Data ini merupakan data pegawai sebuah perusahaan yang berisikan tentang informasi demografis dan metrik terkait pekerjaan. Berikut detail fitur pada data: https://github.com/dicodingacademy/dicoding_dataset/tree/main/employee
import pandas as pd
import numpy as np
from scipy import stats
import plotly.graph_objs as go
import plotly.express as px
from joblib import dump, load
import plotly.offline as pyo
# plotly offline
pyo.init_notebook_mode()
df = pd.read_csv('data/employee_data.csv')
df.head(5)
| EmployeeId | Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 38 | NaN | Travel_Frequently | 1444 | Human Resources | 1 | 4 | Other | 1 | ... | 2 | 80 | 1 | 7 | 2 | 3 | 6 | 2 | 1 | 2 |
| 1 | 2 | 37 | 1.0 | Travel_Rarely | 1141 | Research & Development | 11 | 2 | Medical | 1 | ... | 1 | 80 | 0 | 15 | 2 | 1 | 1 | 0 | 0 | 0 |
| 2 | 3 | 51 | 1.0 | Travel_Rarely | 1323 | Research & Development | 4 | 4 | Life Sciences | 1 | ... | 3 | 80 | 3 | 18 | 2 | 4 | 10 | 0 | 2 | 7 |
| 3 | 4 | 42 | 0.0 | Travel_Frequently | 555 | Sales | 26 | 3 | Marketing | 1 | ... | 4 | 80 | 1 | 23 | 2 | 4 | 20 | 4 | 4 | 8 |
| 4 | 5 | 40 | NaN | Travel_Rarely | 1194 | Research & Development | 2 | 4 | Medical | 1 | ... | 2 | 80 | 3 | 20 | 2 | 3 | 5 | 3 | 0 | 2 |
5 rows × 35 columns
Berdasarkan data yang ada, tujuan dari proyek ini adalah:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmployeeId 1470 non-null int64 1 Age 1470 non-null int64 2 Attrition 1058 non-null float64 3 BusinessTravel 1470 non-null object 4 DailyRate 1470 non-null int64 5 Department 1470 non-null object 6 DistanceFromHome 1470 non-null int64 7 Education 1470 non-null int64 8 EducationField 1470 non-null object 9 EmployeeCount 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: float64(1), int64(26), object(8) memory usage: 402.1+ KB
# Jumlah data duplikat
print('Jumlah Data Duplikat: ', df.duplicated().sum())
Jumlah Data Duplikat: 0
EmployeeCount tidak memiliki keterangan fitur, maka kolom akan diselidiki lagi lebih lanjut untuk menentukan tindakan yang tepat.Attrition seharusnya hanya memiliki 2 nilai, yaitu 1(Yes) dan 0(No), maka kita akan mengubah tipe data yang semula float menjadi int.Attrition. Karena salah satu tujuan kita adalah untuk memprediksi apakah seorang karyawan akan keluar dari perusahaan atau tidak, maka kita akan memisahkan baris data yang memiliki nilai Null pada kolom Attrition dan menyimpannya pada variabel df_pred yang nantinya akan dilakukan prediksi.EmployeeId merupakan kolom yang tidak diperlukan untuk memprediksi apakah seorang karyawan akan keluar dari perusahaan atau tidak, maka kolom tersebut akan dihapus.# Nilai unik kolom EmployeeCount
print('Nilai unik kolom EmployeeCount: ', df['EmployeeCount'].unique())
Nilai unik kolom EmployeeCount: [1]
EmployeeCount hanya memiliki satu nilai yang membuat kolom ini tidak dapat digunakan dalam analysis dan pemodelan machine learning. Maka kolom ini akan dihapus.
# Menghapus EmployeeId dan EmployeeCount
df = df.drop(columns=['EmployeeCount', 'EmployeeId'])
# Meletakkan kolom target pada posisi paling depan untuk memudahkan memisahkan antara fitur dan target
df = df[['Attrition'] + [col for col in df.columns if col != 'Attrition']]
# Nilai Null pada kolom Attrition
print('Jumlah nilai NULL: ', df['Attrition'].isnull().sum())
print('Persentase nilai NULL: ', round(
df['Attrition'].isnull().sum()/len(df)*100, 2), '%')
Jumlah nilai NULL: 412 Persentase nilai NULL: 28.03 %
# Dataframe dengan nilai null pada kolom Attrition
df_pred = df[df['Attrition'].isnull()]
df_pred = df_pred.drop(['Attrition'], axis=1)
print(df_pred.shape)
df_pred.head(5)
(412, 32)
| Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | HourlyRate | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38 | Travel_Frequently | 1444 | Human Resources | 1 | 4 | Other | 4 | Male | 88 | ... | 2 | 80 | 1 | 7 | 2 | 3 | 6 | 2 | 1 | 2 |
| 4 | 40 | Travel_Rarely | 1194 | Research & Development | 2 | 4 | Medical | 3 | Female | 98 | ... | 2 | 80 | 3 | 20 | 2 | 3 | 5 | 3 | 0 | 2 |
| 5 | 29 | Travel_Rarely | 352 | Human Resources | 6 | 1 | Medical | 4 | Male | 87 | ... | 4 | 80 | 0 | 1 | 3 | 3 | 1 | 0 | 0 | 0 |
| 12 | 47 | Travel_Rarely | 571 | Sales | 14 | 3 | Medical | 3 | Female | 78 | ... | 3 | 80 | 1 | 11 | 4 | 2 | 5 | 4 | 1 | 2 |
| 18 | 25 | Travel_Frequently | 772 | Research & Development | 2 | 1 | Life Sciences | 4 | Male | 77 | ... | 3 | 80 | 2 | 7 | 6 | 3 | 7 | 7 | 0 | 7 |
5 rows × 32 columns
# Dataframe tanpa nilai null pada kolom Attrition
df = df.dropna(subset=['Attrition'])
print(df.shape)
df.head(5)
(1058, 33)
| Attrition | Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.0 | 37 | Travel_Rarely | 1141 | Research & Development | 11 | 2 | Medical | 1 | Female | ... | 1 | 80 | 0 | 15 | 2 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1.0 | 51 | Travel_Rarely | 1323 | Research & Development | 4 | 4 | Life Sciences | 1 | Male | ... | 3 | 80 | 3 | 18 | 2 | 4 | 10 | 0 | 2 | 7 |
| 3 | 0.0 | 42 | Travel_Frequently | 555 | Sales | 26 | 3 | Marketing | 3 | Female | ... | 4 | 80 | 1 | 23 | 2 | 4 | 20 | 4 | 4 | 8 |
| 6 | 0.0 | 40 | Travel_Rarely | 1124 | Sales | 1 | 2 | Medical | 2 | Male | ... | 3 | 80 | 3 | 6 | 2 | 2 | 4 | 3 | 0 | 2 |
| 7 | 1.0 | 55 | Travel_Rarely | 725 | Research & Development | 2 | 3 | Medical | 4 | Male | ... | 4 | 80 | 1 | 24 | 2 | 3 | 5 | 2 | 1 | 4 |
5 rows × 33 columns
# Mengubah tipe data kolom Attrition
df['Attrition'] = df['Attrition'].astype('int')
print(df['Attrition'].value_counts())
Attrition 0 879 1 179 Name: count, dtype: int64
# memisahkan data bertipe object dan numerik
df_obj = df.select_dtypes(include=['object'])
df_num = df.select_dtypes(include=['number'])
print(f'Jumlah Kolom Objek : {len(df_obj.columns)}')
print(f'Jumlah Kolom Numerik : {len(df_num.columns)}')
Jumlah Kolom Objek : 8 Jumlah Kolom Numerik : 25
df_num.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Attrition | 1058.0 | 0.169187 | 0.375094 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 |
| Age | 1058.0 | 37.055766 | 9.410421 | 18.0 | 30.00 | 36.0 | 43.0 | 60.0 |
| DailyRate | 1058.0 | 809.542533 | 408.478049 | 102.0 | 465.25 | 817.5 | 1168.5 | 1499.0 |
| DistanceFromHome | 1058.0 | 8.978261 | 8.040608 | 1.0 | 2.00 | 7.0 | 13.0 | 29.0 |
| Education | 1058.0 | 2.879017 | 1.031442 | 1.0 | 2.00 | 3.0 | 4.0 | 5.0 |
| EnvironmentSatisfaction | 1058.0 | 2.712665 | 1.092959 | 1.0 | 2.00 | 3.0 | 4.0 | 4.0 |
| HourlyRate | 1058.0 | 65.643667 | 20.324861 | 30.0 | 48.00 | 65.0 | 83.0 | 100.0 |
| JobInvolvement | 1058.0 | 2.737240 | 0.704730 | 1.0 | 2.00 | 3.0 | 3.0 | 4.0 |
| JobLevel | 1058.0 | 2.090737 | 1.136514 | 1.0 | 1.00 | 2.0 | 3.0 | 5.0 |
| JobSatisfaction | 1058.0 | 2.749527 | 1.104144 | 1.0 | 2.00 | 3.0 | 4.0 | 4.0 |
| MonthlyIncome | 1058.0 | 6625.945180 | 4837.695042 | 1009.0 | 2900.25 | 4903.5 | 8736.5 | 19999.0 |
| MonthlyRate | 1058.0 | 14267.281664 | 7048.077305 | 2094.0 | 8277.25 | 14201.0 | 20364.0 | 26999.0 |
| NumCompaniesWorked | 1058.0 | 2.706994 | 2.523526 | 0.0 | 1.00 | 2.0 | 4.0 | 9.0 |
| PercentSalaryHike | 1058.0 | 15.155955 | 3.641097 | 11.0 | 12.00 | 14.0 | 18.0 | 25.0 |
| PerformanceRating | 1058.0 | 3.150284 | 0.357518 | 3.0 | 3.00 | 3.0 | 3.0 | 4.0 |
| RelationshipSatisfaction | 1058.0 | 2.726843 | 1.090970 | 1.0 | 2.00 | 3.0 | 4.0 | 4.0 |
| StandardHours | 1058.0 | 80.000000 | 0.000000 | 80.0 | 80.00 | 80.0 | 80.0 | 80.0 |
| StockOptionLevel | 1058.0 | 0.762760 | 0.837537 | 0.0 | 0.00 | 1.0 | 1.0 | 3.0 |
| TotalWorkingYears | 1058.0 | 11.435728 | 8.016429 | 0.0 | 6.00 | 10.0 | 16.0 | 40.0 |
| TrainingTimesLastYear | 1058.0 | 2.769376 | 1.302689 | 0.0 | 2.00 | 3.0 | 3.0 | 6.0 |
| WorkLifeBalance | 1058.0 | 2.763705 | 0.707392 | 1.0 | 2.00 | 3.0 | 3.0 | 4.0 |
| YearsAtCompany | 1058.0 | 7.065217 | 6.265227 | 0.0 | 3.00 | 5.0 | 9.0 | 40.0 |
| YearsInCurrentRole | 1058.0 | 4.260870 | 3.617470 | 0.0 | 2.00 | 3.0 | 7.0 | 18.0 |
| YearsSinceLastPromotion | 1058.0 | 2.203214 | 3.266948 | 0.0 | 0.00 | 1.0 | 3.0 | 15.0 |
| YearsWithCurrManager | 1058.0 | 4.142722 | 3.599044 | 0.0 | 2.00 | 3.0 | 7.0 | 17.0 |
df_obj.describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| BusinessTravel | 1058 | 3 | Travel_Rarely | 746 |
| Department | 1058 | 3 | Research & Development | 701 |
| EducationField | 1058 | 6 | Life Sciences | 436 |
| Gender | 1058 | 2 | Male | 620 |
| JobRole | 1058 | 9 | Sales Executive | 232 |
| MaritalStatus | 1058 | 3 | Married | 464 |
| Over18 | 1058 | 1 | Y | 1058 |
| OverTime | 1058 | 2 | No | 751 |
# Mencari nilai unik pada setiap kolom bertipe object
for column in df_obj.columns:
print(f'Nilai unik {column} : {df_obj[column].unique()}')
Nilai unik BusinessTravel : ['Travel_Rarely' 'Travel_Frequently' 'Non-Travel'] Nilai unik Department : ['Research & Development' 'Sales' 'Human Resources'] Nilai unik EducationField : ['Medical' 'Life Sciences' 'Marketing' 'Technical Degree' 'Human Resources' 'Other'] Nilai unik Gender : ['Female' 'Male'] Nilai unik JobRole : ['Healthcare Representative' 'Research Scientist' 'Sales Executive' 'Manager' 'Laboratory Technician' 'Research Director' 'Manufacturing Director' 'Human Resources' 'Sales Representative'] Nilai unik MaritalStatus : ['Married' 'Single' 'Divorced'] Nilai unik Over18 : ['Y'] Nilai unik OverTime : ['No' 'Yes']
Over18 dan StadardHours hanya memiliki satu buah nilai, maka kolom tersebut akan dihapus.# Menghapus Over18 dan StandardHours
df_obj = df_obj.drop(columns=['Over18'])
df_num = df_num.drop(columns=['StandardHours'])
# Menghapus pada dataframe utama
df = df.drop(columns=['Over18', 'StandardHours'])
df_pred = df_pred.drop(columns=['Over18', 'StandardHours'])
# Visualisasi data kategorikal dengan target/label
for column in df_obj.columns:
fig = px.histogram(df, x=column, color="Attrition", barmode="group")
fig.update_layout(title=f'{column} vs. Attrition', xaxis_title=column,
yaxis_title="Count", bargap=0.2, width=800, height=500)
fig.show()
Masalah yang dapat disebabkan oleh outlier adalah:
# fungsi untuk menentukan outliers
def detect_outliers_zscore(column, threshold=3):
z_scores = stats.zscore(column)
outliers = (z_scores > threshold) | (z_scores < -threshold)
return outliers
# Mengubah nilai outliers dengan nilai rata-rata
def replace_outliers_with_median(df):
# Dictionary berisi statistik kolom numerik yang memiliki outliers
outlier_stats = {}
# Kolom numerik
num_cols = df_num.columns
for col in num_cols:
outliers = detect_outliers_zscore(df[col])
if outliers.sum() > 0:
median_value = df[col].median()
df[col] = np.where(outliers, median_value, df[col])
outlier_stats[col] = {
'num_outliers': int(outliers.sum()),
'mean': df[col].mean(),
'std': df[col].std(),
'min': df[col].min(),
'25%': df[col].quantile(0.25),
'50%': df[col].median(),
'75%': df[col].quantile(0.75),
'max': df[col].max(),
}
# Menampilkan statistik kolom yang memiliki outliers
if len(outlier_stats) > 0:
df_outliers = pd.DataFrame(outlier_stats).T
df_outliers.index.name = 'column_name'
print('Pasca penggantian nilai outliers dengan median: ')
display(df_outliers)
return df
else:
print('Tidak terdapat outliers')
return df
outliers_stat = replace_outliers_with_median(df)
Pasca penggantian nilai outliers dengan mean:
| num_outliers | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| column_name | ||||||||
| TotalWorkingYears | 10.0 | 11.177694 | 7.606985 | 0.0 | 6.0 | 10.0 | 15.0 | 35.0 |
| YearsAtCompany | 20.0 | 6.568053 | 5.263681 | 0.0 | 3.0 | 5.0 | 9.0 | 25.0 |
| YearsInCurrentRole | 10.0 | 4.132325 | 3.411466 | 0.0 | 2.0 | 3.0 | 7.0 | 15.0 |
| YearsSinceLastPromotion | 29.0 | 1.852552 | 2.619879 | 0.0 | 0.0 | 1.0 | 2.0 | 11.0 |
| YearsWithCurrManager | 12.0 | 3.994329 | 3.364060 | 0.0 | 2.0 | 3.0 | 7.0 | 14.0 |
Terdapat beberapa teknik lain untuk menentukan sebuah outlier salah satunya adalah IQR (Interquartile Range). Selain itu penanganan outlier juga bermacam-macam, seperti menghapus outlier atau mengubah menjadi nilai modus, mean, atau menghapus outlier tersebut.
df_obj = df.select_dtypes(include=['object'])
df_num = df.select_dtypes(include=['number'])
# Membuat korelasi antar kolom numerik
corr_matrix = df_num.corr(method='pearson')
# Create the heatmap trace
heatmap = go.Heatmap(
z=corr_matrix.values,
x=corr_matrix.index.values,
y=corr_matrix.columns.values,
colorscale='Viridis',
text=corr_matrix.round(2),
hovertemplate='Feature 1: %{y}<br>' +
'Feature 2: %{x}<br>' +
'Correlation: %{z:.2f}<br>',
)
# Create the layout object
layout = go.Layout(
title='Numeric Correlation Heatmap',
xaxis=dict(title='Features'),
yaxis=dict(title='Features'),
autosize=False,
height=1200,
width=1200,
margin=dict(l=50, r=50, t=130, b=130),
)
# Create the figure object
fig = go.Figure(data=[heatmap], layout=layout)
# Add the correlation values to the heatmap
fig.update_traces(texttemplate='%{text:.2f}')
# Show the figure
fig.show()
Attrition.WorkLifeBalance tidak memiliki pengaruh yang signifikan terhadap Attrition dan pada fitur lainnya.Masih terdapat banyak relasi antar fitur yang dapat dianalisis lebih lanjut. Coba lakukan analisis lebih lanjut untuk menemukan relasi antar fitur yang lain menggunakan chart lain seperti scatter plot, box plot, dan pie chart.
# Save Dataframe
dump(df, 'joblib/df.joblib')
dump(df_pred, 'joblib/df_pred.joblib')
['joblib/df_pred.joblib']